Shotgun Metagenomic Data Analysis ◾ 309
-1 fastq_pure/ERR1823608_pure_R1.fastq.gz \
-2 fastq_pure/ERR1823608_pure_R2.fastq.gz \
-0 /dev/null -s /dev/null -n
The saved FASTQ files contain the reads of the metagenomic data after removing the con-
taminating human sequences.
If you run FastQC on those files, you will find that the reads length varies between 35
and 151 bp. We can remove any paired reads of length less than 50 bp. Removing such
reads from paired-end FASTQ files requires a program or a script that removes reads from
both pair of files without leaving singletons that may be rejected by some programs used in
the downstream analysis. Thus, we need to filter reads of certain length by using the right
program. There may be some programs that can do this but we use a bash script for this
purpose. First, change into “fastq_pure” and run the following script to decompress the
FASTQ files:
cd fastq_pure
for i in $(ls *.gz);
do
gzip -d ${i}
done
Then, open a text editor of your choice such as “vim or nano” and save the following bash
script in a file “remove_PE.sh”:
vim remove_PE.sh
Then, copy the bash script into the file, save it, and exit:
#!/bin/sh
#Use: remove_PE.sh R1.fastq R2.fastq 80
#1. Start with inputs
fq_r1=$1
fq_r2=$2
minLength=$3
#2. Find all entries with read length less than minimum length and
print line numbers, for both R1 and R2
awk -v min=$minLength \
‘{if(NR%4==2) \
if(length($0)<min) \
print NR”\n”NR-1”\n”NR+1”\n”NR+2}’ \
$fq_r1 > temp.lines1
awk -v min=$minLength \
‘{if(NR%4==2) \
if(length($0)<min) \
print NR”\n”NR-1”\n”NR+1”\n”NR+2}’ \
$fq_r2@ temp.lines1